Ignore remote ES|QL execution failures when skip_unavailable=true #116365

smalyshev · 2024-11-06T21:50:08Z

Catch the error coming from remote cluster and mark it as PARTIAL. Looks like we need to do it in two places, since otherwise acquireAvoid will take over and cancel the whole task, which we don't want to happen.

In runtime, the following failures can happen, which need to be covered:

Can not send message to remote (disconnect)
Error when establishing exchange
Error during computation
Disconnect during computation

See also: #112886

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java

elasticsearchmachine · 2024-11-13T23:53:39Z

Hi @smalyshev, I've created a changelog YAML for you.

quux00

First pass review with some questions and comments.

.../esql/compute/src/main/java/org/elasticsearch/compute/operator/exchange/ExchangeService.java

quux00 · 2024-11-14T21:43:20Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java

+
+    /**
+     * Marks the cluster as PARTIAL and adds the exception to the cluster's failures record.
+     * Currently, additional failures are not recorded, TODO: check if this should be the case.


It's OK to add more failures to the cluster metadata list. In search when shard level searches occur, a cluster might have multiple failures listed in the array, so feel free to do that here if there is a use case for it.

OK I'll do it later maybe, for now I think I want to deal with a single failure first without the complication of handling more than one.

quux00 · 2024-11-14T21:59:07Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java

 import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicBoolean;

+import static org.elasticsearch.xpack.esql.session.EsqlSessionCCSUtils.markClusterEmptyInfo;


My thinking about the EsqlSessionCCSUtils is that it would be specific to plan-time operations and not used at execution time (that's why I renamed it from Costin's CcsUtils to EsqlSessionCCSUtils). So I would want us to think about if there are enough "helper" methods needed at execution time to create a "ComputeCCSUtils"? If not, then we should rename EsqlSessionCCSUtils to maybe just "EsqlCCSUtils".

Right now there's one method only - one that creates that "empty" cluster state (which is yes, not empty, see below) but if there's more then we could move it. I don't want to create another utils just for one method though.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java

elasticsearchmachine · 2024-11-20T22:10:16Z

Hi @smalyshev, I've updated the changelog YAML for you.

dnhatn

I've reviewed the ComputerService and ComputeListener. I think we're getting closer. Thanks for your iterations on this @smalyshev.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

dnhatn · 2024-12-03T06:31:14Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

+            };
+            // Cancel the group on sink failure
+            ActionListener<Void> exchangeListener = computeListener.acquireAvoid().delegateResponse((inner, e) -> {
+                taskManager.cancelTaskAndDescendants(groupTask, "exchange sink failure", true, ActionListener.noop());


I think we should wait for the cancellation here. I think we should also fix the same issue in ComputerListener. Can you move this into the cancellation listener?

if (suppressRemoteFailure) { computeListener.markAsPartial(clusterAlias, e); taskManager.cancelTaskAndDescendants(groupTask, "exchange sink failure", true, ActionListener.running(() -> inner.onResponse(null)); } else { inner.onFailure(e); }

Wait, wouldn't that code not cancel task on failure?

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java

quux00 · 2024-12-03T14:57:08Z

...plugin/esql/src/test/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtilsTests.java

            assertClusterStatusAndShardCounts(remote2Cluster, EsqlExecutionInfo.Cluster.Status.SKIPPED);
        }

+        // skip_unavailable=true clusters are unavailable, both marked as PARTIAL


I don't think this test belongs here?

This is in the test for testUpdateExecutionInfoWithUnavailableClusters, which only ever sets status to SKIPPED, not PARTIAL.

This test should go in different method (not sure which though).

...l/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClustersQueryIT.java

smalyshev · 2025-01-30T23:09:18Z

Superceded by #121240

elasticsearchmachine added the v9.0.0 label Nov 6, 2024

quux00 reviewed Nov 7, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java Outdated Show resolved Hide resolved

quux00 reviewed Nov 7, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java Outdated Show resolved Hide resolved

quux00 reviewed Nov 7, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeListener.java Show resolved Hide resolved

smalyshev force-pushed the skip-on-fail branch 2 times, most recently from 72a0519 to a9d0b09 Compare November 7, 2024 20:26

smalyshev changed the title ~~Ignore failures on skip_unavailable~~ Ignore remote ES|QL execution failures when skip_unavailable=true Nov 12, 2024

smalyshev added :Analytics/ES|QL AKA ESQL >enhancement labels Nov 13, 2024

smalyshev and others added 6 commits November 13, 2024 16:55

Ignore failures on skip_unavailable

318d424

Cover more cases for skip

378f71a

Fix existing tests by defaulting skip_un to false

dc8ec22

Add tests for both skip_un settings

dc1e4fa

Add handling remote sink failures

6c09cb4

Update docs/changelog/116365.yaml

2842278

smalyshev force-pushed the skip-on-fail branch from d7d55f5 to 2842278 Compare November 13, 2024 23:55

smalyshev added 3 commits November 14, 2024 08:54

Merge branch 'main' into skip-on-fail

22c8bf7

Enable runtime missing index tests

1135803

More runtime missing index tests

a61ed17

smalyshev added the v8.17.0 label Nov 14, 2024

quux00 reviewed Nov 14, 2024

View reviewed changes

Add cancellation/shutdown tests

9a81820

quux00 mentioned this pull request Nov 19, 2024

ESQL: CCS skip_unavailable testing for non-matching index expressions under RCS2 #116846

Merged

smalyshev added 3 commits November 19, 2024 11:05

Merge branch 'main' into skip-on-fail

944df10

Merge branch 'main' into skip-on-fail

9104b47

Fix utils test

7c25b35

elasticsearchmachine added v8.18.0 and removed v8.17.0 labels Nov 20, 2024

Merge branch 'main' into skip-on-fail

6f035a2

smalyshev requested a review from nik9000 December 2, 2024 19:50

Test fixes

441b390

dnhatn self-requested a review December 3, 2024 00:57

dnhatn reviewed Dec 3, 2024

View reviewed changes

dnhatn self-requested a review December 3, 2024 06:43

quux00 reviewed Dec 3, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSessionCCSUtils.java Outdated Show resolved Hide resolved

quux00 reviewed Dec 3, 2024

View reviewed changes

...l/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/CrossClustersQueryIT.java Show resolved Hide resolved

smalyshev and others added 7 commits December 3, 2024 10:49

Pull feedback

1e73b14

Merge branch 'main' into skip-on-fail

a32de43

Merge branch 'main' into skip-on-fail

5633936

Merge branch 'main' into skip-on-fail

496c273

Merge branch 'main' into skip-on-fail

c0a677b

Merge branch 'main' into skip-on-fail

8a3a5ee

spotless

cbae222

smalyshev added the auto-backport Automatically create backport pull requests when merged label Jan 2, 2025

smalyshev added 2 commits January 14, 2025 12:21

Merge branch 'main' into skip-on-fail

ff16ebf

fix test

1f1601b

smalyshev force-pushed the skip-on-fail branch from b0ee146 to 1f1601b Compare January 14, 2025 19:45

smalyshev added 2 commits January 14, 2025 12:56

Move test

4c19802

Fix test

4f360ef

elasticsearchmachine added v8.19.0 v9.1.0 and removed v8.18.0 v9.0.0 labels Jan 30, 2025

smalyshev marked this pull request as draft January 30, 2025 18:50

dnhatn removed their request for review January 31, 2025 18:46

smalyshev closed this Jan 31, 2025

Ignore remote ES|QL execution failures when skip_unavailable=true #116365

Ignore remote ES|QL execution failures when skip_unavailable=true #116365

Uh oh!

Conversation

smalyshev commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 13, 2024

Uh oh!

quux00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quux00 Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

smalyshev Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

quux00 Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

smalyshev Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dnhatn Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

smalyshev Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quux00 Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smalyshev commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smalyshev commented Nov 6, 2024 •

edited

Loading